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@ A document marker (27) including first values 
dependent upon the layout 'and the contents of 
the document and assigned by generating or 
preprocessing software (21) is provided in 
machine-readable symbology on the face of a 
printed version (24) of the document. The mar- 
ker (27) may include encoded document layout 
information and values assigned on sequences 
of the original text, including text-dependent 
decimation sequences, error correction codes 
or check-sums. Upon optical character recogni- 
tion scanning (16), or other digitizing reproduc- 
tion, the marker (27) is also scanned. The 
scanning computer (28), having corresponding 
software (29,26), assigns second values depen- 
dent upon the layout and contents of the repro- 
duced document Upon comparison of the first 
and second decimation sequences, line and 
character errors can be detected and some 
errors corrected, thereby generating re-aligned 
candidate sequences. Optional error correction 
codes can provide further correcting capabili- 
ties, as applied to the re-aligned reproduced 
document sequences, and an optional check- 
sum comparison can be utilized to verify that 
the accuracy of the reproduced sequences is 
correct 
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This invention relates to the use of automatically generated document markers, and this Application is re- 
lated to our European Patent Application No. 93308020.2. More particularly, the invention relates to the use 
of markers of encoded information incorporated into each page of a document for providing a means for not 
only document identification and document structure recognition, but also error detection and error correction 
5 when the marked documents are reproduced using optical character recognition technology. 

Background of the Invention 

The identification of products using computer readable bar codes, wherein digital data is recorded directly 

10 on paper, provides for item identification given a fixed set of values using simple numeric encoding and scan- 
ning technologies. Identification of computer generated and stored documents is another technology which 
has been developed using binary encoding to identify and provide for retrieval of stored documents. Most docu- 
ment-generating software programs provide not only identification and/or retrieval information for the docu- 
ment, but also include encoded information for provision to an associated printer specifying, for example, such 

is details as spacing, margins and related layout information. Once the document has been printed on paper, how- 
ever, that information no longer accompanies the document, other than as discerned by the user. If it is desired 
to reproduce the document using an optical character recognition (OCR) system, there is no automatic means 
by which to communicate the layout information through the scanner and to the receiving computer. A desirable 
extension of the identification technology would be, therefore, the provision of a means for generating a paper 

20 version of a document which can be recognized, reproduced and proofread by a computer by optically scanning 
a marker incorporated in or on the paper document in conjunction with the OCR text scanning of the document. 

Document or product identification systems which have been employed in the past include barcode mark- 
ers and scanners which have found use in a wide range of arenas. With respect to paper documents, special 
marks or patterns in the paper have been used to provide information to a related piece of equipment, for ex- 

25 ample the job control sheet for image processing as taught by Hikawa in U.S. Patent No. 5,051 ,779. Similarly, 
identifying marks have been incorporated into forms as described in U.S. Patent No. 5,060,980 of Johnson, 
et al. The Johnson, et al. system provides for the editing of forms which are already resident in the computer. 
A paper copy of the form is edited by the user and then scanned to provide insertions to the fields of the du- 
plicate form that is stored electronically in the computer. Still another recently patented system is described 

30 in U.S. Patent 5,091 ,966 of Bloomberg, et al., which teaches the decoding of glyph shape codes, which codes 
are digitally encoded data on paper. The identifying codes can be read by the computer and thereby facilitate 
computer handling of the document, such as identifying, retrieving and transmitting the document The sys- 
tems described in the art do not incorporate text error detection or correction schemes. Further, the systems 
require that the associated computers have a copy of the document of interest in its memory prior to the input 

35 of information via the scanning. The systems cannot be applied to documents which are being created in the 
scanning computer by OCR. 

Optical character recognition systems, are illustrated schematically in the accompanying Figure 1, gener- 
ally include a digitizing scanner, 16, and associated "scanning" computer, 18, for scanning a printed page, 14, 
which was generated by an originating computer, 12, and output by a printer, 13. The scanner, 16, extracts 

40 the text to be saved, as electronic document, 15, in a standard electronic format, such as ASCII. What is de- 
sirable is to additionally incorporate information about the text and layout for error detection and correction, 
which information can be optically scanned or otherwise automatically input 

Due to the inherent limitations in both the scanning process and the ability of an optical character recog- 
nition system to effect accurate character recognition, errors are introduced into the output, including not only 

45 character misinterpretation errors but also layout-dependent errors. The typical character misinterpretation er- 
rors which occur in the OCR reproduction of documents include the following: substitution errors, wherein er- 
roneously-identified characters are substituted for the actual printed characters (e.g., "h" for "b", wherein "the 
bat" becomes "the hat"); deletion errors, wherein characters or spaces are erroneously omitted from the scan- 
ned region (e.g., "the bat" becomes "that"); and, insertion errors wherein characters or spaces are erroneously 

so inserted into the reproduced region (e.g., "the bat" becomes "t, he b at"). In addition, a common error can, in 
fact, be a combination of these basic error types (e.g., reading "rn" for "m" involves a substitution and an in- 
sertion, while reading "H"for "f1" involves a substitution and a deletion). In addition, entire lines of text can be 
inserted or deleted in the course of OCR scanning and reproduction. Traditional error detection/correction 
schemes generally operate to detect and correct substitution errors but are ineffectual at detecting and cor- 

55 recting deletion and insertion errors of the kind encountered in OCR, as further discussed herein. 

Post-processing, specifically error detection and correction, must then be performed, primarily by human 
proofreading of the reproduced document. Errors in layout are ordinarily not automatically rectif iable by the 
computer but, rather, require extensive, user-intensive editing or possibly re-creation of the document The hu- 
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CM text, and prints the check-sum symbol or symbols at the end of each printed line of text in the document. 
Upon OCR scanning of the printed line, the printed check-sum symbol is also scanned and ". . . processed in 
a routine manner to produce an ASCII code serial bit stream . . ." Upon reproduction of the printed line, a check- 
sum value for the reproduced line of text is calculated and compared to the scanned symbol. If the two check- 
5 sums do not match, the existence of an error is assumed, the line is rescanned, and the process is repeated 
until a match is found, if ever. No intra-line error location can be realized by the McGinn system, nor can actual 
correction of a detected error be conducted short of rescanning and reproducing the line, if even then. 

Since the McGinn system encodes the check-sum symbol using ASCII text, the symbol is optically scanned 
and recognized using the same technology as the standard text. Consequently, error-free location and recog- 
10 nition of the check-sum symbol cannot be guaranteed. The recognition system may not be able to distinguish 
the symbol from the line text Moreover, the symbol may be erroneously identified. A difference between the 
scanned symbol and the calculated check-sum for the reproduced text may, therefore, be indicative of misin- 
terpretation of the check-sum symbol even if accurate reproduction of the scanned text has been achieved. 
Another class of OCR reproduction errors which cannot be accounted for when using the McGinn system is 
15 the omission or insertion of entire text lines. Absent a corresponding scanned check-sum, the McGinn system 
can neither account for nor correct entire line errors. In effect, therefore, the McGinn system simply confirms 
the accuracy of text reproduced by OCR, as opposed to improving that accuracy. 

It is therefore an objective of the present invention to provide a means and method for automatically in- 
corporating information markers on a paper document, which information is encoded to provide a variety of 
20 detail about the document to an associated computer. 

It is another objective of the invention to establish the absence or presence of errors on a page reproduced 
using OCR technology without requiring an entry- by-entry comparison. 

It is another objective of the invention to provide an error detection system and method for precisely lo- 
cating errors on a page reproduced using OCR technology. 
25 It is still another objective of the invention to provide an error detection system which can be used in con- 

junction with existing error correction systems to precisely locate document errors and compensate for deletion 
and insertion errors before effecting substitution error correction procedures. 

Another objective of the invention is to provide an automatic error correction means and method for docu- 
ments reproduced using OCR technology. 
30 It is yet another objective of the invention to provide an error detection system which can overlook inten- 

tional misspellings, abbreviations, etc. 

It is a further objective of the invention to provide an error detection system which can be used with any 
document format, fonts, and related hardware. 

It is yet another objective of the invention to provide a means for providing documents with unique markers 
35 which can be used to impart various information to computers. 

Still another objective of the present invention is to provide a means and method for supplying documents 
with computer-readable markers which contain information about the document including document structure, 
error identification, location and correction information, and document identification and retrieval information. 

40 Summary of the Invention 

These and other objectives are realized by a system which implements the creation and incorporation of 
a document marker for documents to be reproduced. The marker can include a variety of information including 
document structure and error detection encoding. The error detection/correction encoding information com- 

45 prises a certificate, including at least one value calculated on the text and incorporated, by one of various en- 
coding techniques, into the certificate of the marker provided on the face of the document to be reproduced. 
Upon OCR reproduction of the document, certificate values for the text, as the text appears on the reproduc- 
tion, can be recalculated and then compared to the original certificate values. If the values match, the prob- 
ability is that the reproduction is error-free. If the certificate values do not match, at least one error is present 

so in the text as reproduced. The certificate can provide not only error detection, but also error location (for ex- 
ample, which character on a line is in error); and can include error correction codes or pointers to traditional 
dictionary lookup and semantic systems. Additional information can be encoded, with the calculated text cer- 
tificates or as separate information in the machine readable markers, to provide information regarding the 
document layout, document identification, document location in the computer system, destinations of comput- 

55 ers or other interconnected peripherals for transmission of the document, and such other information as may 
be required. 
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man post-processing is expensive not only in terms of actual costs but also in the time needed to complete 
the processed document. Optimally, solutions will provide not only a means for detecting character substitution 
errors but also a means for detecting and correcting all of the character and line misinterpretation errors. Fur- 
ther, an ideal solution should additionally facilitate identification of the document itself and communicate the 
appropriate layout structure for the document 

Error detection/correction systems which have been employed in the computer document creation tech- 
nology (e.g., word processing) include techniques based on dictionary lookup and/or attempts to use semantic, 
or context, information extracted from the document in order to identify and correct errors. Many of these sys- 
tems require that entries in the document which do not correlate to an entry in the lexicon will be reviewed by 
a "human post-processor". The automated error correction version of a dictionary-based system will, upon 
identification, spontaneously correct entries which do not correlate to dictionary entries. One can readily en- 
vision a circumstance wherein automatic spelling correction is not desirable, such as in the case of a proper 
name, an intentional misspelling or a newly coined term. The presumption in the use of dictionary-comparison 
versions of such systems is that each entry in the entire document be compared to a data-base dictionary of 
terms. The cost of comparison of each entry of a document to a given lexicon is quite high. Streamlined error 
detection and location, without the need for entry-by-entry comparison, is desirable. 

The use of semantic information extracted from the document is further proposed in the art in order to 
facilitate the identification and automatic correction of errors that have been detected but which cannot be read- 
ily identified as misspellings of available dictionary terms or which "resemble" more than one available dic- 
tionary entry. Such a system will recognize and correct the term "of the" to "of the" when a dictionary lookup 
would simply reject the term or miscorrect it Similarly, a bank of commonly-occurring errors for the hardware 
or software being used, and for the font or fonts being scanned, has been proposed for use with the context, 
or semantic, information in order to identify and automatically correct common errors, such as "rn" being in- 
correctly identified as "m", or the letter "O" being incorrectly identified as the number "0". 

To detect errors without requiring an entry-by-entry lookup, particularly for documents which are trans- 
mitted over extended networks, systems have made use of parity bits transmitted with the data. Once the 
transmission has been effected, a bit count is done on the "new" document If the calculated bit matches the 
transmitted parity bit, then an error-free transmission is assumed. Such systems, and extensions of the parity 
and check bit concept, as taught in U.S. Patent No. 5,068,854 of Chandran, et at, are useful for detecting errors 
in digitally encoded information. Further extensions of the parity bit concept, such as balanced weight error 
correcting codes, to detect and provide correction of more than a one-bit error are also found in the art, such 
as in U.S. Patent No. 4,965,883 of Kirby. Parity and check bit systems developed for use with binary coded 
information are capable of ascertaining the presence of errors with reasonable accuracy, given the low prob- 
ability of the error bit of an erroneously-received quantity of data matching the check bit of the transmitted 
material. Since the bits are calculated on binary-encoded data, they are most effective for detecting one-bit 
errors, except as modified in the weighted balancing and random checking instances. Generally speaking, how- 
ever, the check and parity bit systems tend to be data-independent methods for assuring error-free transmis- 
sion of computer-to-computer transfers. The check and parity bit systems are not therefore, considered thor- 
ough checking systems but merely first screening techniques which are intended for digital-to-digital commu- 
nications and not obviously applicable to analog-to-digital conversions such as optical character recognition. 

A f urther prior art system, providing a 16-bit check sequence which is data-dependent and calculated on 
the contents of the data field, is found in U.S. Patent No. 4,964,127 of Calvignac, et al. Once again, the system 
is applied to data which is transmitted along a data path, presumably in digital format. 

In the field of optical character recognition (OCR), there is a similar need to provide the means for detecting 
and correcting errors in data which has been reproduced from optical scanning, bit mapping and computer en- 
coding. Both dictionary lookup and common-error reference have been proposed for use in the OCR context. 
However, as with the document creation needs of the past, the entry-by-entry checking is inefficient and not 
guaranteed to produce the correct result. Moreover, in addition to the printed words, the document layout is a 
critical feature in OCR. The use of current parity bit check systems in an optically-scanned, bit mapped system 
is only nominally effective for error detection, relatively ineffective for error location and totally ineffective for 
detection and correction of improper layout 

Apparatus for identifying and correcting "unrecognizable" characters in OCR machines is taught in U.S. 
Patent No. 4,974,260 of Rudak. In that system, the characters which are not recognized, in the electronic dic- 
tionary lookup operation, are selectively displayed for an operator to effect interpretation and correction. More 
fully automated OCR error detection and correction is desirable, but not currently available. 

U.S. Patent No. 4,105,997 of McGinn, entitled "Method For Achieving Accurate Optical Character Reading 
of Printed Text" provides a basic error detection scheme for checking the accuracy of text reproduced using 
optical character recognition. The McGinn system calculates a check-sum value for each line of data using AS- 
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certificate generator or certificate verifier is incorporated into the overall system and need not be a separate 
program. In addition, the calculating of certificates for original or scanned data is not necessarily a separate 
process step, but may be conducted concurrently with the creating and/or saving of the data. Once calculated, 
the new certificate values may be compared to the original certificate values scanned from the printed page. 

5 If the new and the original certificate values match, the translation is deemed complete and presumed to be 
error free with high probability. If there is a mismatch, the OCR certificate verifier software can detect and 
correct a small number of errors, given its own or one or more of the known error correction schemes used by 
current document creation or OCR systems, as discussed above. 

As illustrated in Figure 6, the document marker may include several kinds of encoded data-dependent 

10 document information, including the page structure encoding (i.e., document layout information) and one or 
more certificates for the text. The certificates may be calculated on a per line basis, as is illustrated, or may 
be calculated on a block of text, which block may encompass all or some portion of the page. Several methods 
of encoding the text for inclusion in the certificate, including the decimation function referenced in Figure 6, 
are detailed below. In addition to the text decimation encoding for the line, the certificate may include an op- 
ts tional error correction code and an optional check-sum, either of which may also be utilized in ascertaining 
the accuracy of the reproduction and in correcting same. It is further to be noted that the certificate "compo- 
nents", e.g., the decimation string, the error correction codes and the check-sum, need not be calculated on 
the same amount of material on the page. As suggested by the drawing, the certificate may include encodings 
of all three values for each line, wherein generation of the certificate for a block of text would involve steps of 

20 calculating a decimation for the block of text, providing an error correcting code for the text (or a pointer telling 
the scanning computer to invoke certain known error correction lookup tables or the like), and providing a 
check-sum calculated on the block of text As an alternative, the various certificate values may be calculated 
on different sized blocks of text; for example the decimation may be conducted for a line of text, while the check- 
sum could be calculated on a paragraph or on the entire page contents, or such other variation as is practicable 

25 and clearly envisioned by the present description and claims. 

The error correction codes which may be incorporated into the certificate portion of the document markers 
may be chosen to address the typical misinterpretation errors which can be expected given the text, the print 
codes, fonts, etc. and the anticipated scanner technology. If particular errors, such as the standard font char- 
acterization errors mentioned in the background section of the application, are expected, those specific errors 

30 can be accounted for in the certificate for the given text. In the alternative, the certificate can include a pointer 
directing the scanning computer to the applicable error correction lookup tables resident therein. It is conceiv- 
able that the anticipated character misinterpretation errors for the available technology will be so numerous 
as to render the text uncorrectable-given a lack of similarity to expected characters and spacings. In such an 
instance, it would be most advantageous to encode the entirety of the text, or a compressed version thereof, 

35 in the certificate. 

As mentioned above, the marker may be, and preferably is, provided on the face of the printed document 
using a technology other than standard printed characters. Given the problem at hand, the less than perfect 
ability of OCR to reproduce printed characters, a more highly machine-readable and reproducible technology 
such as barcode symbology is preferably employed when providing the marker on the surface of the document. 

40 Use of a more reliable symbology will not only promise more accurate interpretation of the symbol itself, but 
can also include internal error correction mechanisms for further ensuring accurate reading of the marker. The 
scanning computer can be pre-programmed to locate the marker in a pre-determined location on the page, or 
can search each page it encounters for the document marker. It is not necessary that the marker be readable 
by, or even perceptible to, a human user of the document. The marker can, in fact, be provided in a symbology 

45 which is invisible on the face of the page, yet still perceptible to the scanner. 

Once the marker has been discerned and decoded by the scanning machine, the certificate values can 
be used to verify the accuracy of the reproduced text. A first level of error detection is the decimation and re- 
alignment function, which can detect and correct insertion errors and can detect deletion errors and convert 
them into substitution errors, thereby generating at least one partially corrected candidate string of text, as 

so will be further detailed below. After the decimation and re-alignment function, if error correction information 
has been encoded in the certificate, it may be invoked to address any substitution errors which may be found 
in a given re-aligned candidate. Further, either prior to, in lieu of, or after an iteration of substitution error cor- 
rection has been completed, if a certificate check-sum is available, a check-sum for the corrected, reproduced 
text can be calculated and compared to the originally scanned check-sum for the relevant text block. If the cer- 

55 tif icate does not include any error correction codes, but does have a check-sum for the original text, a check- 
sum may be calculated for a re-aligned candidate sequence without conducting any error correction beyond 
that achieved by the decimation and re-alignment function. In either instance, successive candidate sequences 
can be tried if the initially generated one is not fully corrected. Clearly, the order of invoking the levels of com- 
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Brief Description of the Drawings 

The invention will now be described in greater detail with reference to the accompanying drawings wherein: 

Figure 1 schematically illustrates the prior art OCR method of scanning and reproducing a document. 
5 Figure 2 schematically illustrates the OCR method for reproducing a document with markers having cer- 

tificates to provide error detection and correction. 

Figure 3 illustrates a document generated in accordance with the present invention. 

Figure 4 illustrates a complex document containing diagrams, text blocks and photographs. 

Figures 5A through 5F illustrate one scheme for encoding the layout of the complex document illustrated 
10 in Figure 4. 

Figure 6 schematically illustrates the contents of a document marker in accordance with the present in- 
vention. 

Figure 7 illustrates a traceback table created by the edit distance function described herein. 
Figures 8A and 8B illustrate alignment of full text and decimated line sequences for a printed sentence 
15 and the OCR reproduction of same. 

Figure 9 provides a representative flow chart of the processing steps performed by the certifying software 
when utilizing a certificate having the decimation, error correction code and check-sum information of the pres- 
ent invention encoded therein. 

Figure 10 illustrates a table of edit distance values for determining the correspondence of lines of OCR 
20 reproduced text to the original lines of text and locating any full-line deletions and insertions. 

Detailed Description of the Preferred Embodiments 

In accordance with the present invention, markers are created for paper documents which may contain 

25 data-dependent document information, including, but not limited to, a "certificate" encoding error detection and 
error correction information, and a document layout code, for communication to a "scanning" computer and 
use by the scanning computer upon reproduction of the document using OCR technology. 

When creating the certificate component of a marker during the computer generation or preprocessing of 
an original document, the certificate generator 21 of the originating computer, 22 as illustrated in Figure 2, 

30 calculates one or many data-dependent certificates, 27, with an appropriate algorithm, several examples of 
which are detailed below. A certificate is a succinct key encoding information about the contents of the page, 
produced upon generation of the document or at print time, and recognisable by the OCR software, 29, asso- 
ciated with the scanning computer, 28. Any document generated on a computer can have a marker including 
at least one certificate appended to or associated with each text block or page. The process of generating the 

35 marker requires no human intervention, and only a small added computational cost. As illustrated in Figure 3, 
the document 34, as generated as a printed page or in another medium, is comprised of an area 35, formatted 
primarily for human use, and an area 37 formatted for machine use to assist the machine in its 'understan- 
ding* of the so-called 'human* area. The human area is the analog portion of the document and the machine 
area is the digital portion of the document The distinction is used to designate the use made of the portions 

40 of the document rather than the specific embodiments. The two portions can, and preferably would be printed 
using the same technology. As an example, the 'digital' portion, i.e., the marker, can be printed using a special 
font bar code or other symbology which may or may not be 'readable' to the human user, but which is chosen 
to facilitate computer readability. The marker is intended to provide information to the OCR software so that 
it becomes possible to produce a perfectly transcribed digital copy of the original printed page. 

45 The marker that is computed and printed on the page contains information about the contents of that page. 

The originating computer, 22 of Figure 2, includes certifying software, 21, referred to as the certificate gen- 
erator. Once the document has been created, or in the process of the creation thereof, the certifying software 
calculates one or more certificates based upon the information in and on the document It is to be noted that 
certificate values for the original document need not have been assigned upon creation of the original docu- 

50 ment, but can be created by preprocessing the original document through the certifying software prior to print- 
ing. The generated marker, including the one or more calculated certificates, 27, is produced as a machine 
readable part of the hardcopy, 24, of the document which is output by the printer, 13. As in the prior art, the 
hard copy to be reproduced is scanned using a digitizing scanner, 16, which is associated with a second com- 
puter, 28, equipped with the OCR software. The original marker, 27, is also optically scanned and saved by 

55 the scanning computer. After the OCR document 25 has been created, the OCR software uses the same al- 
gorithm, as that used by the original printing software, to calculate one or more certificate values for the ex- 
tracted text. The illustration provides the "certificate verifier", 26, as a separate part of the computer 28 and 
the "certificate generator", 21, as part of computer 22. As would be evident to one having skill in the art, the 
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in the art will readily recognize alternative computation methods which fall within the scope of the present in- 
vention. One having skill will additionally recognize that the size of the check-sum and the computation method 
can be varied to increase or decrease the probability of error detection as required by the specific application. 
Another preferred text encoding scheme of the present invention, referred to above as the "decimation 

5 and re-alignment" function, can effectively detect and correct insertion errors and detect deletion errors con- 
verting them into substitution errors to be addressed in a subsequent error correction step. Due to the lack of 
correspondence between characters in the original and the reproduced versions of a document, deletions and 
insertions are not readily addressed by the known substitution algorithms which identify recognized character 
errors in an arbitrary data stream and then provide alternative character sequences for same. Substitution 

10 algorithms cannot account for or recognize the existence of insertions or deletions. The first phase, therefore, 
of the re-alignment function is to enforce correspondence between characters in the original and the repro- 
duced data streams. v 
For the sentence having the original text character sequence: 
The quick brown fox jumped over the lazy dog.", 

is the following scanned line character sequence may be reproduced: 
The qUick br own fox jumped over the lazydog." 
Note that the OCR reproduced line is one character longer than the original line of text This violates the align- 
ment assumption underlying traditional error correcting codes. Furthermore, the Hamming distance between 
the two lines (i.e., the number of positions in which the two lines differ) is 35. Hence, even if the original line 

20 is augmented with an additional space or other character in order to equalize the lengths of the lines, a tradi- 
tional error correcting code would have to be able to handle up to 35 substitution errors to correct the line, 
which is simply not feasible under the presently available technology. 

The problem introduced by random insertions and/or deletions is the "sliding" of the original and the re- 
produced lines in relation to each other, which increases the Hamming distance so that many more substitution 

25 errors arise. The decimation approach counteracts the effects of deletions and insertions by enforcing corre- 
spondence between the characters on the lines. 

To identify deletions and insertions, the well-known concept of approximate string matching is employed. 
The relationship between two similar but not necessarily identical lines of text can be made mathematically 
precise using an edit model wherein the basic operations of deleting a character, inserting an arbitrary char- 

30 acter and substitution one character for another are used. Each of these operations is assigned a cost, c^i, 
C| ra and Csub, and the minimum cost sequence of operations that transforms one string into the other is called 
the edit distance. The optimization of edit distance is realized using a well-known dynamic programming algo- 
rithm wherein s 1( S2_ Si are the first i characters of the original line, and tj,t2...tj are the first j characters in the 
OCR reproduced line. Defining dy to be the distance between the two substrings, the dynamic programming 

35 recurrence is: 

di - U + CdeKs,) 
dg = min djj . , + q^) 
di . i j - i + Cs Ub (S|,t,) 

In addition, if the choices which lead to the minimums (i.e., the optimal decisions calculated above) are map- 

40 ped, the resulting traceback table provides the sequence of operations which will perform the transformation 
needed to align and edit, or correct, the reproduced string. Figure 7 illustrates the combination edit distance/tra- 
ceback table comparing the original word "character" to the erroneously reproduced word "chanacer". The se- 
quence of bold-strike arrows leading from the lower right hand corner of the table to the upper left corner cor- 
responds to the optimal editing path. The asterisked arrows indicate the location of a deletion (the letter "t") 

45 and a substitution (the letter "n" for "r") 

In general, there may be more than one optimal editing path through the table. Figure 8A illustrates an 
alignment chart for the two sentences, or character sequences recited above. As can be seen, the correspon- 
dence between "m"and n rn" provides for two possible interpretations, depending upon which character is chos- 
en for deletion and which for substitution. 

so As an alternative, the original, or source, text is "decimated" whereby each character of the original text, 

including spaces, is mapped to a single bit in the certificate. In the context of ASCII encoding, which is fairly 
common to computer-generated documents, one bit of the ASCII representation of each character can be as- 
signed to the certificate for that character. For example, one encoding scheme which has been reduced to prac- 
tice utilizes the next-to-lowest order bit of the ASCII encoding for each character as the certificate value for 

55 that character. The decimation of the original text line printed above then becomes the following: 

000 0000 110 11 1110 110 010 0000 01 101000 0000 1000111. 



8 



EP 0 649 112 A2 

parison and correction is variable, depending upon the nature and frequency of expected errors, the availability 
of error correction codes and/or check-sum, and the costs (both in monetary and time constraint valuation) of 
each iteration. 

As described in our co-pending Application No. 93308020.2, the check-sum can be computed in any of a 
number of ways. For example, the "C" subroutine shown below computes a simple check-sum on a line-by- 
line basis as follows: 



10 

#include <stdio.h> 
#include <ctype.h> 

#def ine MAXLEN 200 /*raaximum input line length */ 
Main<) { 

char line [MAXLEN] , /*jinput line */ 

hash; /* 8-bit hash value */ 

int len, /* length of input line */ 

i; /* counter */ 

while (gets (line) !- NULL) { /* while more lines */ 

len - strlen( line) ; /* get length of input line */ 

30 if (len > 0) /* if line is non-empty */ 

hash - 0; /* initialize 8-bit hash */ 

for (i - 0; i < len; i++) { /* check each character */ 

if (lisspace (line[il)){ /* if character is non-space */ 

35 hash - line [i]; /* XOR ASCII value with hash */ 

hash - (hash <<1) | ((hash > > 7) & 0x01); /* left-rotate hash */ 
} 

} 

} 

printf {~%.2x m , hash & Oxff); /* print hash value */ 
} 



15 



20 



25 



40 



} 



45 



so The ASCII Value of each non-space character is exclusive-or'd with a running 8-bit check-sum. This check- 

sum is then bit-rotated one position to the left, and the process is repeated with the next character in sequence. 
In this case, the line "This is a test." would receive the check-sum "03" (expressed in hexadecimal notation), 
which would be printed on the page in question. If, in the process of scanning, the OCR software misread the 
line as "Thus is a test", the calculated check-sum would be "73". Hence, the OCR software would detect the 

55 presence of an error by comparing the two check-sums (one newly computed on the reproduced text and one 
originally computed, printed and read from the printed certificate) and determining that they do not match. In 
using this sample system, the probability that two random lines of text would have the same check-sum is 1 
in 256. The eight-bit check-sum is only one example of a certificate value computation system. Those skilled 
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Figure 9 illustrates a representative process flow utilizing the decimation and re-alignment function, 
wherein the certificate includes not only the decimation encoding, but also error correction code and check- 
sum encoding. Printed page 74 is scanned, whereby two sequences of information are provided to the receiv- 
ing/scanning computer, previously referred to as computer system 2. One sequence of information received 

5 by the scanning computer is the ASCII text (with errors) of the scanned line character sequence, at 71 . The 
other sequence of information is the original certificate found in the document marker, for this example con- 
taining at least one original decimated sequence, at least one error correction code and at least one check- 
sum, scanned from the printed page, at 70. The certifying software applies the decimation function to the ASCII 
text of the scanned line character sequence, at 73, and provides the resulting decimated sequence for com- 

10 parison to the original decimated sequence during the re-alignment procedure illustrated at 75. One of two 
alternative process paths may be followed upon completion of the first iteration of the re-alignment procedure. 
A candidate corrected sequence may be provided directly for check-sum calculation and comparison, illustrat- 
ed by the line from the re-alignment procedure to box 76 at which the check-sum is calculated for the candidate 
sequence. The calculated check-sum is then compared, at 77, to the original check-sum, provided by box 70. 

15 As indicated by decision box 78, if the check-sums match, the certified ASCII text is output, or otherwise proc- 
essed as appropriate, at box 80. 

If the check-sums do not match, the candidate sequence may be provided for substitution error correction 
at box 72, if the answer to the question "Has substitution error correction been done?", at decision box 79, is 
"No". An alternative path is to first conduct substitution error correction on the candidate sequence and either 

20 assume a fully corrected sequence or calculate the check-sum for the corrected candidate sequence and com- 
pare the calculated check-sum to the original check-sum. Should the check-sum for the corrected candidate 
sequence not match the original check-sum, and the substitution error correction already, necessarily, have 
been conducted for the given candidate, the process will return to the re-alignment step, 75, for processing of 
an alternative candidate sequence. As will be apparent to one having skill in the art, the exact progression and 

25 use of optional process steps can be altered and optimized without departing from the inventive content there- 
of. 

The earlier-described difficulty of realizing and correcting line deletion and insertion errors can similarly 
be addressed using edit distance processing as illustrated in Figure 10. Figure 10 is a correspondence table 
comparing the original text lines and the OCR text lines. The comparison determines the edit distance between 

30 the decimation sequence of the original text characters and the decimation sequence of the OCR reproduced 
text for each line. If the lines align, the edit distance will be zero, as indicated by the "0's" located primarily 
along the diagonal, assuming that there are not OCR errors in the reproduced line. If OCR errors exist, the 
edit distance for two "correctly corresponding" lines will be relatively small, and low integers will be found along 
the diagonal of the correspondence table. The line correspondence software can be provided with a preset 

35 threshold value of "similarity" of lines given the known error correction capability of the available codes. How- 
ever, the edit distance between an original and an OCR line that do not correspond will, in all probability, be 
quite large, as indicated by the larger integers away from the diagonal. When a high edit distance number is 
encountered, the line correspondence software will compare the relevant line of original text to another differ- 
ent line of OCR reproduced text and continue to do so until it finds a reasonably corresponding line, i.e., having 

40 a relatively low edit distance number. 

The Figure 10 table provides an illustration of the two major line errors which can be addressed using the 
edit distance, line correspondence function. As the edit distance is analyzed for line 5 of the original text, it is 
apparent from the absence of low integers that there is little correspondence between any of the OCR lines 
and line 5 of the original text. From this analysis, it is apparent that line 5 of the original text has been omitted 

45 from the OCR reproduced text. Examining the column corresponding to OCR line 8, it is also apparent from 
the lack of low integers, that few, if any, characters in the line of OCR reproduced text correspond to the char- 
acters of any line of the original text. The conclusion, therefore, is that line 8 of the OCR text has been erro- 
neously inserted, since it does not correspond with any of the lines of the original text. No previous automatic 
document correcting scheme has been capable of providing this level of error correction. 

50 As is evident from the foregoing, the line error correction procedure utilizes the character decimation cer- 

tificate values. If line errors are to be expected, the Figure 9 process flow would ideally include a line corre- 
spondence step prior to the character alignment performed at 75. Needless to say, if the OCR reproduced line 
is not correctly aligned to the original line, and therefore not being compared to a corresponding line of the 
original text, the subsequent character alignment cannot effectively be performed. 

55 One important consideration, particularly in the case of documents with complicated structures, is deter- 

mining the canonical parsing order for computing the certificate value. Obviously, the software that calculates 
the original certificate values and the OCR software must both use the same order. For layout encoding, one 
linearization formula may follow a left-to-right, top-to-bottom order in the same way that English text is normally 
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The decimation value is incorporated into the certificate which is associated and printed on the page, with the 
original text line. Upon OCR reproduction of the line, the certifying software performs the same mapping of 
characters to bits. The resulting decimation of the scanned line character sequence printed above then would 
be as follows: 

5 

000 0 00 0110110111011001011000011110 000 0 00100111. 

The certifying software then performs sequence alignment between the two decimated certificates to deter- 
mine possible locations of deletions, insertions and even some substitutions. Figure 8B illustrates an alignment 
10 chart, or traceback table, for the two decimated strings. In the alignment of the decimated strings, one region 
of uncertainty is broader than seen in a simple alignment of the actual character sequences. It is clear that 
the decimation alignment can recognize deletions and insertions and can at least partially identify some sub- 
stitution errors. 

The certifying software invokes a re-alignment algorithm for correcting the misaligned (i.e., erroneously 
is reproduced) scanned line character sequence. As part of the re-alignment process, the algorithm will produce 
a number of partially corrected versions of the scanned line. For the scanned sentence provided above, several 
of the corrected candidates may include, among others, the following character sequences: 
The qUick brown fox ju*ped ov*r the laz*ydog. 
The qUick brown fox ju*ped ov*r the lazy*dog. 
20 The qUick brown fox ju*ped ov*r the lazyd*og. 

The Hamming distances between the original line and the candidate lines are 5, 4 and 5 respectively. Al- 
though the substitution error "U" for "u" was not detected for the line, due to the particular decimation function 
applied, the "rn" for "m" substitution was flagged, with one extra character being deleted and an asterisk sub- 
stituted for the other, and the "c" for "e" substitution was flagged, with the erroneous character being replaced 
25 by an asterisk. The space added in "brown" was detected and deleted, while the space deleted between "lazy" 
and "dog" was identified and compensated for by inserting an asterisk. Since the re-alignment cannot precisely 
locate the latter-recited deletion, three possible candidates were generated. 

What is clear from the re-aligned character sequences is that, in terms of fine length and character corre- 
spondence, deletion and insertion errors have been compensated for by the re-alignment algorithm. Assuming 
30 only insertion errors, the re-alignment procedure could, therefore, result in a 100% corrected sequence. If a 
check-sum was provided in the marker, a check-sum for the corrected sequence can be calculated and com- 
pared to the original check-sum to indicate successful correction via re-alignment. 

Should re-alignment fail to produce a fully corrected sequence, the other available value or values in the 
certificate for the text can be used. As noted above, the re-alignment software will generate candidate strings 
35 each corresponding to one minimum-cost editing path through the dynamic programming traceback tabie. If 
the certificate additionally includes an optional check-sum for the original text a check-sum for the re-aligned 
candidate sequence can be generated for comparison to the original. Obviously, if the check-sums do correlate, 
the assumption is that the re-aligned candidate sequence is "correct" If the certificate includes an error cor- 
rection code for the original text, that error correction code can be applied to the re-aligned candidate sequence. 
40 The error correction code for the text is encoded to anticipate the expected OCR errors for the given character 
set. As such, substitution errors can be readily addressed and corrected by the accompanying error correction 
code. 

Given a line for which the certificate contains decimation, error correction code and a check-sum, the 
check-sum calculation for the decimated, aligned and substituted sequence can be conducted and the resulting 
45 check-sum compared to the original check-sum. If the check-sums do not match, the re-alignment software 
produces another possible partially corrected candidate for substitution correction and check-sum calculation 
and comparison, and so forth, until a corrected reproduction of the original sequence is produced. In the rare 
case that no corrected reproduction is produced, it may be necessary to flag the sequence for manual "post 
processing." 

so As discussed in the above Background section, the error correction methods which are available for in- 

corporation into an OCR system include dictionary lookup search strategies, semantic or context information 
codes and common error recognition codes, among others. Certificates can improve OCR recognition rates 
and provide a reliable method by which users can ascertain whether or not each scanned page is error free. 
As noted above, use of an error detection and correction system without knowing if intentional "errors" exist 

55 in a document can actually cause errors to be introduced into the text. When using a certificate system of error 
detection and correction, however, this can be avoided. In the instance of an intentional misspelling, for ex- 
ample, the certificate system would not indicate that an error had been made, and would not therefore erro- 
neously correct the intentional misspelling. 
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read by humans. Another approach would be to decompose the page as a series of text blocks, each a separate 
entity in the calculation. Any blocks containing graphics or other non-text information must be handled differ- 
ently than standard text. In the case of diagrams, recognition that a collection of "dots" corresponds to a perfect 
circle is a difficult task for image processing software. If, however, the certificate generator encodes the in- 

5 formation that a given diagram contains 3 circles and a triangle, this information may greatly speed processing 
time and increase accuracy. Beyond encoding the existence of the diagram components, the precise locations 
and sizes of the basic geometric elements in the diagram could be encoded (e.g., circle radius 0.3 cm; x-co- 
ordinate 1.3 cm, y-coordinate 3.8 cm, etc.). It is further possible to adapt a certificate scheme to recognize 
mathematical equations or other special typeset structures. 

10 As noted with reference to Figure 6, it is also desirable to incorporate the document structure information 

in a document marker. A 15 x 18cm (6" x 7") document having a complicated layout structure is illustrated in 
Figure 4. The document, 44, contains text blocks A, B, D, E and G at 45, a photograph in block C at 48, and a 
diagram in block F at 46. In order to identify the document layout to the scanning system, one layout identifi- 
cation system which can be utilized is based upon a plane-slicing model, as is illustrated in Figures 5A through 

is 5F. Other models can be utilized as appropriate. The plane-slicing model example presumes that the layout 
of a document, no matter how complicated, can be described by some number of cuts. The plane slicing can 
be represented recursively as a binary tree, provided the canonical ordering for the subplanes, represented 
by the leaves, has been defined. The slices or planes are first identified, and characterized as specific hori- 
zontal and vertical components, H and V, for example. Each slice is identified as a part of a tree structure. 

20 This tree structure is then encoded as a linear string. A recursive syntax is used to yield, for the illustrated 
document, the following: 

(H1 "CRT A (V2"(H3"CRT B (H5"PHT C CRT D ))(V4"CRT E (H3"FIG F CRT G )))); 
wherein each precise location is defined in inches, vertically or horizontally oriented; CRT represents the cer- 
tificate calculated for the designated text block; PHT represents the photograph; and, FIG designates the di- 

25 agram. This short string then completely describes the basic layout of the document Within the string can be 
embedded additional information such as a description of the subplane contents (E.G., text, figure, equation, 
photo), precise x,y coordinates of the subplane on the page, and of course the error detection/correction cer- 
tificate values. 

To remain unobtrusive to the human reader, it is possible to 'hide' the markers in, for example, a logo. A 
30 2 x 2cm (3/4" x 3/4") logo can encode over 1,000 bits of information. Other embodiments may include using 
invisible inks or hiding the markers in the format of the document itself. It is not necessary that the information 
be provided apart from the human readable portion of the document, for example in a blank margin. What is 
necessary is that the information be provided in such a manner that the computer can readily ascertain its 
location and easily read the encoded information; and that it not interfere with the human readable portion in 
35 such a manner as to render it unreadable. 

Since the markers are being described in terms of OCR use, it has been assumed thus far that the medium 
for reading the marker is a digitizing scanner. With the development of other input media, the encoding schemes 
will require adaptation to accommodate the relevant system. As with all of the preceding discussion, such mod- 
ifications as would occur to one having skill in the art may be made without departing from the spirit and scope 
40 of the appended Claims. 



Claims 

45 1 . A method for electronically reproducing character data of a computer-generated printed document, com- 
prising the steps of: 

assigning a plurality of first binary values to first character data, wherein one first binary value is 
assigned to each one character of said first character data; 

printing said data and said plurality of first binary values; 
so optically scanning said printed document comprising said first character data and said plurality of 

first binary values to create an electronic document comprising at least one string of second characters; 

assigning a plurality of second binary values to said string of second characters, wherein one sec- 
ond binary value is assigned to each one second character; and 

comparing said plurality of first binary values to said plurality of second binary values. 

55 

2. The method Claim 1 further comprising the step of identifying the existence and location of second char- 
acter errors whenever said second binary value for a character is different from said first binary value 
for said character. 
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3. The method of Claim 2 further comprising the step of automatically correcting said identified errors. 

4. The method of Claim 3 wherein said correcting comprises realigning said plurality of second binary values 
to said plurality of first binary values to eliminate said errors. 

5. The method of any preceding Claim further comprising the step of creating at least one string of third char- 
acters by altering said at least one string of second characters. 

6. The method of any preceding Claim wherein said assigning further comprises assigning at least one first 
data-dependent value to said first character data and said printing further comprises printing said at least 
one first data-dependent value. 

7. The method of Claim 6 wherein said assigning of at least one first data-dependent value to said first char- 
acter data comprises encoding error correction information for said first character data. 

15 8. The method Claim 7 further comprising applying said encoded error correction information to said at least 
one string of third characters. 

9. The method of any of Claims 6 to 8 wherein said assigning of at least one data-dependent value to said 
first character data comprises calculating at least one first check-sum for said first character data. 

20 

1 0. The method of Claim 9 further comprising the steps of: 

calculating at least one second check-sum for said at least one string of third characters; 
comparing said at least one first check-sum to said at least one second check-sum; and 
detecting the existence of at least one error in said at least one string of third characters when said 
2s check-sums differ. 

11. A method for encoding data-dependent information about a printed document having at least a plurality 
of lines of printed text comprising the steps of: 

decimating said printed text into a plurality of binary values; and 
30 printing a marker comprising said plurality of encoded binary values in machine-readable symbol- 

ogy on the surface of said printed document 

12. The method of Claim 11 further comprising encoding the details of said document layout and printing said 
document layout encoding in said marker. 

35 13. The method of Claim 11 or 12further comprising assigning a plurality of substitution error correction codes 
to said printed text and printing said codes in said marker. 

14. The method of any of Claims 11 to 1 3 further comprising calculating at least one check-sum for said printed 
text and printing said check-sum in said marker. 

15. A marker for provision in machine-readable symbology on the surface of a printed document having lines 
of printed characters, comprising: 

at least one decimation sequence encoding the printed characters. 

45 16. The marker of Claim 1 5 further comprising at least one check-sum calculated on said printed characters. 

17. The marker of Claim 15 or 16 further comprising at least one layout dependent value calculated on the 
layout of said printed document. 
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1 8. The marker of any of Claims 1 5 to 1 7 further comprising at least one error correction code related to said 
printed characters. 

19. A method for producing more accurate optical character recognition reproduction of documents having 
lines of original printed text and at least one document marker including a plurality of first decimation se- 
quences, each corresponding to a sequence of said printed text, and at least one first check-sum value 
calculated on said sequence of printed text, comprising the steps of: 

creating an electronic document comprising a plurality of first reproduced text sequences by opt- 
ically scanning said original printed text; 
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optically scanning said at least one document marker; 

decoding said plurality of first decimation sequences from said scanned document marker; 
decimating said reproduced text into a plurality of second decimation sequences; 
calculating edjt distances between corresponding lines of original printed text and reproduced print- 
5 ed text by comparing said first and said second decimation sequences; 

comparing said edit distances and identifying line insertion and deletion errors in said reproduced 
text when said edit distances differ by more than a predetermined amount; 
correcting detected line insertion and deletion errors; 

comparing each of said corresponding plurality of first and second decimation sequences; 
10 identifying text errors in said reproduced text at the sequence location at which said decimation 

sequences differ; 

substituting different characters in said sequence locations at which text errors have been identi- 
fied to produce at least one second reproduced text sequence; 

calculating a second check-sum for each of said at least one second reproduced text sequences; 
is comparing said second check-sum to said first check-sum; and 

verifying the accuracy of said second reproduced text sequence when said first and said second 
check-sums are equal. 

20. The method of Claim 19 wherein said document marker further comprises a plurality of error correction 
20 codes for said original text and further comprising the step of applying said corresponding error correction 

code to said at least one second reproduced text sequence prior to calculating said second check-sum 
for said text. 

21. A printed document adapted to be accurately reproduced by optical scanning, comprising perceptible print- 
2 5 ed data and at least one machine-readable marker comprising at least one decimation sequence encoding 

said printed data. 

22. The document of Claim 21 wherein said at least one machine-readable marker further comprises at least 
one check-sum calculated on said printed data. 

23. The document of Claim 21 or 22 wherein said at least one machine-readable marker further comprises 
at least one layout dependent value calculated on the layout of said printed document. 

24. The document of any of Claims 21 to 23 wherein said at least one machine-readable marker further com- 
prises at feast one error correction code related to said printed data. 

35 
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© A document marker (27) including first values 
dependent upon the layout and the contents of 
the document and assigned by generating or 
preprocessing software (21) is provided in 
machine-readable symbology on the face of a 
printed version (24) of the document. The mar- 
ker (27) may include encoded document layout 
information and values assigned on sequences 
of the original text, including text-dependent 
decimation sequences, error correction codes 
or check-sums. Upon optical character recogni- 
tion scanning (16), or other digitizing reproduc- 
tion, the marker (27) is also scanned. The 
scanning computer (28), having corresponding 
software (29,26), assigns second values depen- 
dent upon the layout and contents of the repro- 
duced document Upon comparison of the first 
and second decimation sequences, line and 
character errors can be detected and some 
errors corrected, thereby generating re-aligned 
candidate sequences. Optional error correction 
codes can provide further correcting capabili- 
ties, as applied to the re-aligned reproduced 
document sequences, and an optional check- 
sum comparison can be utilized to verify that 
the accuracy of the reproduced sequences is 
correct 
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